The epldata package is a set of nine comprehensive datasets covering players, teams, managers, goals and assists in the English Premier League from its incepton in August 1992 to the final week of the 2017/18 season. It is the intention to update the package annually, shortly after the end of each season
This vignette is a brief introduction to some aspects of the package and how it might be used. Several other packages are utilized. If you are unfamiliar with their functions you will need to refer to their documentation
# Download package if not on your system
#devtools::install_github("pssguy/epldata")
library(epldata)
## This lists the available datasets with a brief description
data(package="epldata")
Let’s look at one of them
library(tidyverse) # for data manipulation
glimpse(players)
#> Observations: 4,690
#> Variables: 7
#> $ player_id <chr> "TISDALP", "TODAK", "TODDA", "TODDL", "TODOROS",...
#> $ first_name <chr> "Paul", "Kazuyuki", "Andy", "Lee", "Svetoslav", ...
#> $ last_name <chr> "Tisdale", "Toda", "Todd", "Todd", "Todorov", "T...
#> $ birth_date <dttm> 1973-01-14, 1977-12-30, 1974-09-21, 1972-03-07,...
#> $ birth_city <chr> "Valletta", "Tokyo", "Derby", "Hartlepool", "Dob...
#> $ birth_country <chr> "Malta", "Japan", "England", "England", "Bulgari...
#> $ position <chr> "M", "M", "M", "D", "F", "M", "F", "F", "F", "F"...The tables are in SQL type with a key variable for linking separate datasets. For this data, it is the unique player_id, which also appears in the player_team dataset. The data provides some basic information on each of the, approaching 5,000, players who have appeared in the League
Let’s use the players data to obtain the percentage distribution of players born from 1990 onwards, by birth country
players %>%
filter(birth_date>"1989-12-31") %>%
group_by(birth_country) %>%
tally() %>%
mutate(pc=round(100*n/sum(n),2)) %>%
arrange(desc(pc))
#> # A tibble: 79 x 3
#> birth_country n pc
#> <chr> <int> <dbl>
#> 1 England 457 46.9
#> 2 France 53 5.44
#> 3 Spain 51 5.24
#> 4 Germany 33 3.39
#> 5 Netherlands 33 3.39
#> 6 Belgium 22 2.26
#> 7 Wales 22 2.26
#> 8 Brazil 18 1.85
#> 9 Ireland 18 1.85
#> 10 Scotland 16 1.64
#> # ... with 69 more rowsPredictably, England dominates but Scotland only scrapes into the top 10
Much more commonly you will need to combine tables to produce interesting information
Which player has scored the most for each team?
player_goals <- players %>%
left_join(player_team) %>%
left_join(player_game) %>%
right_join(goals) %>%
mutate(name=paste(first_name,last_name)) %>%
group_by(player_id,name,team) %>%
tally() %>%
arrange(desc(n)) %>%
group_by(team) %>%
slice(1) %>%
ungroup() %>%
filter(!(is.na(team))) %>%
select(team,name,goals=n)
player_goals
#> # A tibble: 49 x 3
#> team name goals
#> <chr> <chr> <int>
#> 1 Arsenal Thierry Henry 176
#> 2 Aston Villa Gabriel Agbonlahor 74
#> 3 Barnsley Neil Redfearn 10
#> 4 Birmingham C Mikail Forssell 29
#> 5 Blackburn Alan Shearer 112
#> 6 Blackpool DJ Campbell 13
#> 7 Bolton Kevin Davies 67
#> 8 Bournemouth Joshua King 30
#> 9 Bradford C Dean Windass 12
#> 10 Brighton Glenn Murray 12
#> # ... with 39 more rowsThe above example included quite a few joins which you will probably not wish to do for every analysis For instance, you might want to have available a summary of each match played
## goals by team for individual match
goals_by_team <- game_team %>%
left_join(player_game) %>%
right_join(goals) %>%
# sum goals for each team for each game
group_by(team,team_game_id,game_id) %>%
tally() %>%
# need to include games in which no goals were scored by team
right_join(game_team) %>%
mutate(GF=ifelse(is.na(n),0,n)) %>%
select(-c(venue,n))
goals_by_team
#> # A tibble: 20,252 x 4
#> # Groups: team, team_game_id [20,252]
#> team team_game_id game_id GF
#> <chr> <int> <int> <dbl>
#> 1 Blackburn 1 55 0
#> 2 Derby Co. 2 55 0
#> 3 Coventry C 3 56 2
#> 4 Chelsea 4 56 1
#> 5 Everton 38 57 0
#> 6 Aston Villa 39 57 0
#> 7 Man. Utd. 40 58 2
#> 8 Leicester C 41 58 2
#> 9 Middlesbro 42 59 0
#> 10 Leeds U 43 59 0
#> # ... with 20,242 more rowsSo we now have the goals scored by each team. The next step is to combine this table with itself to obtain the opposing team and the goals against
goals_by_game <-goals_by_team %>%
inner_join(goals_by_team,by="game_id") # specify otherwise it will also use team_game_id
head(goals_by_game)
#> # A tibble: 6 x 7
#> # Groups: team.x, team_game_id.x [3]
#> team.x team_game_id.x game_id GF.x team.y team_game_id.y GF.y
#> <chr> <int> <int> <dbl> <chr> <int> <dbl>
#> 1 Blackburn 1 55 0 Blackburn 1 0
#> 2 Blackburn 1 55 0 Derby Co. 2 0
#> 3 Derby Co. 2 55 0 Blackburn 1 0
#> 4 Derby Co. 2 55 0 Derby Co. 2 0
#> 5 Coventry C 3 56 2 Coventry C 3 2
#> 6 Coventry C 3 56 2 Chelsea 4 1We have duplication and wish to remove all those where team.x= team.y. as well as tidy up column names and calculate the points accrued for each match. This takes a few seconds to run
match_summary <- goals_by_game %>%
filter(team.x!=team.y) %>%
select(team=team.x,team_game_id=team_game_id.x,game_id,GF=GF.x,opponents=team.y,GA=GF.y) %>%
mutate(points=case_when(
GF >GA ~ 3,
GF==GA ~ 1,
GF<GA ~ 0
))
match_summary
#> # A tibble: 20,252 x 7
#> # Groups: team, team_game_id [20,252]
#> team team_game_id game_id GF opponents GA points
#> <chr> <int> <int> <dbl> <chr> <dbl> <dbl>
#> 1 Blackburn 1 55 0 Derby Co. 0 1
#> 2 Derby Co. 2 55 0 Blackburn 0 1
#> 3 Coventry C 3 56 2 Chelsea 1 3
#> 4 Chelsea 4 56 1 Coventry C 2 0
#> 5 Everton 38 57 0 Aston Villa 0 1
#> 6 Aston Villa 39 57 0 Everton 0 1
#> 7 Man. Utd. 40 58 2 Leicester C 2 1
#> 8 Leicester C 41 58 2 Man. Utd. 2 1
#> 9 Middlesbro 42 59 0 Leeds U 0 1
#> 10 Leeds U 43 59 0 Middlesbro 0 1
#> # ... with 20,242 more rowsTo put the results into context, we need to add the game date, arrange it sequentially and split the results into seasons. This also takes a few seconds to run
years <- c(1992:2018)
library(lubridate) # for date manipulation
match_summary_full <- match_summary %>%
left_join(games) %>%
mutate(year=year(game_date),month=month(game_date)) %>%
mutate(season= case_when(
month<=7 ~ paste(year-1,year,sep="/"),
month>7 ~ paste(year,year+1,sep="/")
)
) %>%
arrange(game_date) %>%
group_by(season,team) %>%
mutate(year_game_order=row_number())
match_summary_full
#> # A tibble: 20,252 x 14
#> # Groups: season, team [526]
#> team team_game_id game_id GF opponents GA points
#> <chr> <int> <int> <dbl> <chr> <dbl> <dbl>
#> 1 Arsenal 20000 1313 2 Norwich C 4 0
#> 2 Chelsea 20001 1314 1 Oldham 1 1
#> 3 Coventry C 20002 1315 2 Middlesbro 1 3
#> 4 Crystal P 20003 1316 3 Blackburn 3 1
#> 5 Everton 20004 1317 1 Sheff. Wed. 1 1
#> 6 Ipswich T 20005 1318 1 Aston Villa 1 1
#> 7 Leeds U 20006 1319 2 Wimbledon 1 3
#> 8 Sheff. Utd. 20007 1320 2 Man. Utd. 1 3
#> 9 Southampton 20008 1321 0 Tottenham H 0 1
#> 10 Norwich C 21000 1313 4 Arsenal 2 3
#> # ... with 20,242 more rows, and 7 more variables: game_date <dttm>,
#> # crowd <int>, referee_name <chr>, year <dbl>, month <dbl>,
#> # season <chr>, year_game_order <int>This might be a useful derived table to save as a basis for further analyses including
We can now create a standings data.frame for each round of matches based on points, Goal difference , and Goals For
standings <- match_summary_full %>%
select(team,season,game_date,year_game_order,GF,GA,points) %>%
group_by(team,season) %>%
mutate(cum_points=cumsum(points),cum_GF=cumsum(GF),cum_GA=cumsum(GA),cum_GD=cum_GF-cum_GA) %>%
group_by(season,year_game_order) %>%
arrange(desc(cum_points),desc(cum_GD),desc(cum_GF),team) %>%
mutate(position=row_number()) %>%
select(season,team,round=year_game_order,position,GF=cum_GF,GA=cum_GA,GD=cum_GD,points=cum_points) %>%
ungroup() # important otherwise scres up later inc animation
standings
#> # A tibble: 20,252 x 8
#> season team round position GF GA GD points
#> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2017/2018 Man. City 38 1 106 27 79 100
#> 2 2017/2018 Man. City 37 1 105 27 78 97
#> 3 2004/2005 Chelsea 38 1 72 15 57 95
#> 4 2017/2018 Man. City 36 1 102 26 76 94
#> 5 2004/2005 Chelsea 37 1 71 14 57 94
#> 6 2017/2018 Man. City 35 1 102 26 76 93
#> 7 2016/2017 Chelsea 38 1 85 33 52 93
#> 8 1993/1994 Man. Utd. 42 1 80 38 42 92
#> 9 2004/2005 Chelsea 36 1 68 13 55 91
#> 10 1999/2000 Man. Utd. 38 1 97 45 52 91
#> # ... with 20,242 more rowsIt is then a simple matter to create a function to get a table for any round of any year. e.g after 20 games in 1994/1995
table_year_round <- function(x,y){
standings %>%
filter(season==x,round==y)
}
table_year_round("1994/1995",20)
#> # A tibble: 22 x 8
#> season team round position GF GA GD points
#> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1994/1995 Blackburn 20 1 44 16 28 46
#> 2 1994/1995 Man. Utd. 20 2 39 16 23 44
#> 3 1994/1995 Newcastle U 20 3 39 22 17 39
#> 4 1994/1995 Liverpool 20 4 36 19 17 36
#> 5 1994/1995 Nottm Forest 20 5 33 20 13 36
#> 6 1994/1995 Leeds U 20 6 29 25 4 33
#> 7 1994/1995 Norwich C 20 7 19 17 2 30
#> 8 1994/1995 Tottenham H 20 8 34 34 0 29
#> 9 1994/1995 Chelsea 20 9 28 26 2 28
#> 10 1994/1995 Man. City 20 10 31 34 -3 28
#> # ... with 12 more rowsObviously you can vary what is in these derived tables to suit your own requirement
For the premiersoccerstats web site, I create around thirty derived tables weekly for speedy user interaction
We have previously covered basic usage of datasets within the package including combining them to produce answers to questions and creating derived tables.
We will next look at more interesting output in the form of
This package is particularly suited to the first two options though there is some geographic data to play around with
You will need the the data.frames created earlier so if it they are not in your environment either load a saved version or re-run the code
I tend to use the DT package, but there are other options
Let’s use the match_summary_full dataframe to calculate each team’s head to head record. Over and above the current data, we need to create and sum the results
match_summary_full %>%
ungroup() %>% #match_summary_full is grouped tbl_df
group_by(team,opponents) %>%
mutate(result = case_when(
GF > GA ~ "W", #win
GF == GA ~ "D", #draw/tie
GF < GA ~ "L" # loss
)) %>%
select(team,opponents,result,GF,GA,points) %>%
mutate(yesno = 1) %>%
distinct %>%
spread(result, yesno, fill = 0) %>%
summarize(P=n(),W=sum(W),D=sum(D),L=sum(L),ppg=round(sum(points)/P,2))%>%
arrange(desc(ppg)) %>%
DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))This provides a sortable, searchable table
Let’s turn attention to players. Firstly I will create a data.frame for the goals and assists for a specified player
For ease of use below, I have created it as a function and provided an example player_id
player_game_data <- function(player) {
# collect goal information for specific player
df_goals <- players %>%
left_join(player_team) %>%
left_join(player_game) %>%
left_join(goals) %>%
filter(start==TRUE|time_on>0) %>%
select(player_id,last_name,player_game_id,goal_id,team_game_id) %>%
mutate(goal=ifelse(!is.na(goal_id),1,0)) %>%
group_by(player_id,last_name,team_game_id) %>%
summarize(tot_goals=sum(goal)) %>%
filter(player_id==player)
# likewise with assists
df_assists <- players %>%
left_join(player_team) %>%
left_join(player_game) %>%
left_join(assists) %>%
filter(start==TRUE|time_on>0) %>%
select(player_id,last_name,team_game_id,assist_id,player_game_id) %>%
mutate(assist=ifelse(!is.na(assist_id),1,0)) %>%
group_by(player_id,last_name,team_game_id) %>%
summarize(tot_assists=sum(assist)) %>%
filter(player_id==player)
# combine
df_all <- df_goals %>%
inner_join(df_assists) %>%
# create a game order
left_join(game_team) %>%
left_join(games) %>%
arrange(game_date) %>%
mutate(player_game_order=row_number()) %>%
ungroup() %>% #removes unwanted name and PLAYERID
select(player_game_order,tot_goals,tot_assists) %>%
# gather into narrow format for plotting
gather(category,count,-player_game_order)
}
player_df <-player_game_data("SALAHM")
head(player_df)
#> # A tibble: 6 x 3
#> player_game_order category count
#> <int> <chr> <dbl>
#> 1 1 tot_goals 0
#> 2 2 tot_goals 0
#> 3 3 tot_goals 1
#> 4 4 tot_goals 0
#> 5 5 tot_goals 1
#> 6 6 tot_goals 0You can see why you might want to create a derived player table first if you want to do varied detailed analyses particularly where the raw data is only updated annually .saves time and enhances user interactivity experience
Now just choose your plotting package of choice to display the data. I will use plotly as this allows for ease of info-activity including feature such as panning/zooming, hover, tooltips etc.
library(plotly)
player_df %>%
plot_ly(x=~player_game_order, y= ~count,width = 600) %>%
add_bars(color= ~category, colors=c("red","blue")) %>%
layout(barmode="stack")Lots of customization is available within the package.
Lets use the data to create some interactive output
Lets say we use the match_summary_full data to plot a histogram of the goals scored by a team in the Premier League
I have set the eval = FALSE as appshot of Shiny app objects is not yet supported. but the code will run and give a similar output to that of the Crosstalk example below
library(shiny)
library(glue)
shinyApp(
ui = fluidPage(
## calculate an ordered vector of teams to select from
teams <- match_summary_full %>%
pull(team) %>%
unique() %>%
sort(),
selectInput("team", "Select Team:", teams),
plotlyOutput("goals_for")
),
server = function(input, output) {
output$goals_for <- renderPlotly({
match_summary_full %>%
filter(team == input$team) %>%
plot_ly %>%
add_histogram(x = ~ GF) %>%
layout(title = glue("Distribution of Goals scored by {input$team}"))
})
}
)This is an alternative method which does not require access to a server and allows for htmlwidgets to interact with each other
Here is the equivalent input selection(minus a default) and chart as the shiny example above produces
library(crosstalk)
msf <- SharedData$new(match_summary_full)
bscols(
widths = c(12), # forces components into rows
filter_select(id="team",label="Select a Team",sharedData=msf, group = ~team, multiple = FALSE),
plot_ly(msf, x = ~GF, showlegend = FALSE, width=600) %>%
add_histogram(color = ~team, colors = "red")
)If you want eye-candy… We can use the standings dataset prepared earlier Let’s look at how arch-rivals, Brighton and Crystal Palace, fared in 2017/18
# function to add cumulative line
# courtesy Carson Sievert
accumulate_by <- function(dat, var) {
var <- lazyeval::f_eval(var, dat)
lvls <- plotly:::getLevels(var)
dats <- lapply(seq_along(lvls), function(x) {
cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
})
dplyr::bind_rows(dats)
}
# select team(s) to display
teams <- c("Brighton","Crystal P")
# add function to base data and year of interest
df <- standings %>%
filter(season=="2017/2018"&team %in% teams) %>%
accumulate_by(~round)
# static plot - scatter plot- uncolored
base <- df %>%
plot_ly(x=~round,y=~position,width=600) %>%
layout(
xaxis=list(title="Games Played"),
yaxis=list(title="League Standing",range=c(20.5,0.5))
) %>%
config(displayModeBar = F,showLink = F)
# add animation options and color-blind safe colors
base %>%
add_lines(color = ~team, colors="Set2", frame = ~frame, ids = ~team) %>%
animation_opts(500, easing = "linear",mode='immediate') %>%
animation_button(
x = 1, xanchor = "right", y = 0, yanchor = "middle", font = list(color="red"), bgcolor="lightblue"
) %>%
animation_slider(
currentvalue = list(prefix = "Game ")
) Brighton, a promoted club, were expected to be struggle but Crystal Palace spent more of the season in the drop zone. In the end, they both survived relegation by placing higher than 18th